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Abstract. This paper presents a novel approach for the integration of a 
set of XML Schemas. The proposed approach is specialized for XML, is 
almost automatic, semantic and "light". As a further, original, peculiar- 
ity, it is parametric w.r.t. a "severity" level against which the integra- 
tion task is performed. The paper describes the approach in all details, 
illustrates various theoretical results, presents the experiments we have 
performed for testing it and, finally, compares it with various related 
approaches already proposed in the literature. 

1 Introduction 

The Web is presently playing a key role for both the publication and the ex- 
change of information among organizations. As a matter of fact, it is becoming 
the reference infrastructure for most of the applications conceived to handle 
interoperability among partners. 

In order to make Web activities easier, W3C (World Wide Web Consortium) 
proposed XML (extensible Markup Language) as a new standard information 
exchange language that unifies representation capabilities, typical of HTML, and 
data management features, typical of classical DBMS. 

The twofold nature of XML allowed it to gain a great success and, presently, 
most of the new documents published on the Web are written in XML. However, 
from the data management point of view, XML documents alone have limited 
and primitive capabilities. In order to improve these capabilities, in such a way 
to make them similar to those typical of classical DBMS, W3C proposed to asso- 
ciate XML Schemas with XML documents. An XML Schema can be considered 
as a sort of catalogue of the information typologies that can be found in the 
corresponding XML documents; from another point of view, an XML Schema 
defines a reference context for the corresponding XML documents. 

Certainly, XML exploitation is a key step for improving the interoperability of 
Web information sources; however, that alone is not enough to completely fulfill 
such a task. Indeed, the heterogeneity of data exchanged over the Web regards 



not only their formats but also their semantics. The use of XML allows format 
heterogeneity to be faced; the exploitation of XML Schcmas allows the definition 
of a reference context for exchanged data and is a first step for handling semantic 
diversities; however, for a complete and satisfactory management of these last, 
an integration activity is necessary. 

This paper provides a contribution in this setting and proposes an approach 
for the integration of a set of XML Schemas. Our approach behaves as follows: 
first it determines interscheme properties |2l7lllll7ll5j . i.e., terminological and 
structural relationships holding among attributes and elements belonging to in- 
volved XML Schemas. After this, some of the derived properties are exploited 
for modifying involved Schemas in order to make them structurally and seman- 
tically uniform. The modified Schemas are, finally, integrated for obtaining the 
global Schema. 

Let us now examine the peculiarities of our approach in more detail. First, 
it has been specifically conceived for operating on XML sources. In this sense it 
differs from many other approaches already presented in the literature which in- 
tegrate information sources having different formats and structure degrees (e.g., 
relational databases, XML documents, object-oriented sources and so on). Gen- 
erally, such approaches translate all involved information sources into a common 
representation and, then, carry out the integration activity. On the contrary, our 
approach is specialized for integrating XML Schemas. With regard to this, it is 
worth pointing out that: (i) the integration of XML Schemas will play a more 
and more relevant role in the future; (ii) the exploitation of generic approaches 
designed to operate on information sources with different formats, for perform- 
ing the integration of a set of XML Schemas (i.e., a set of sources having the 
same format), is unnecessarily expensive and inefficient. Indeed, it would require 
the translation of involved XML Schemas in another format and the translation 
of the integrated source from such a format back to XML. 

Our approach is almost automatic; in this sense it follows the present trend 
relative to integration techniques. Indeed, owing to the enormous increase of the 
number of available information sources, all integration approaches proposed in 
the last years are semi-automatic; generally, they require the human intervention 
for both a pre-processing phase and the validation of obtained results. The over- 
whelming amount of sources available on the Web leads each integration task to 
operate on a great number of sources; this requires a further effort for conceiv- 
ing more automatic approaches. The approach we are proposing here provides 
a contribution in this setting since it is almost automatic and requires the user 
intervention only for validating obtained results. 

Our approach is "light"; with regard to this we observe that most of the 
existing approaches are quite complex, based on a variety of thresholds, weights, 
parameters and so on; they are very precise but difficult to be applied and fine 
tuned when involved sources are numerous, complex and belonging to hetero- 
geneous contexts. Our approach does not exploit any threshold or weight; as a 
consequence, it is simple and light, since it does not need a tuning activity. 



Our approach is semantic in that it follows the general trend to take into 
account the semantics of concepts belonging to involved information sources 
during the integration task [21417111] . Given two concepts belonging to differ- 
ent information sources, one of the most common way for determining their 
semantics consists of examining their neighborhoods since the concepts and the 
relationships which they are involved in contribute to define their meaning. As a 
consequence, two concepts, belonging to different information sources, are con- 
sidered semantically similar and are merged in the integrated source if their 
neighborhoods are similar. 

We argue that all the peculiarities we have examined above are extremely im- 
portant for a novel approach devoted to integrate XML Schemas. However, the 
approach we are proposing here is characterized by a further feature that, in our 
opinion, is extremely innovative and promising; more specifically, it allows the 
choice of the "severity level" against which the integration task is performed. 
Such a feature derives from the consideration that applications and scenarios 
possibly benefiting of an integration task on the Web are numerous and ex- 
tremely various. In some situations (e.g., in Public Administrations, Finance 
and so on) the integration process must be very severe in that two concepts 
must be merged only if they are strongly similar; in such a case a high severity 
degree is required. In other situations (e.g., tourist Web pages) the integration 
task can be looser and can decide to merge two concepts having some similarities 
but presenting also some differences. At the beginning of the integration activity 
our approach asks the user to specify the desired "severity" degree; this is the 
only information required to her/him until the end of the integration task, when 
she/he has to validate obtained results. It is worth pointing out that, to the best 
of our knowledge, no approaches handling the information source integration at 
various "severity" levels have been previously presented in the literature. Inter- 
estingly enough, a classical approach can be seen as a particular case of that 
presented in this paper in which a severity level is fixed and all concept merges 
are performed w.r.t. this level. 



2 Neighborhood Construction 

In this section we formally introduce the concept of neighborhood of an element 
or an attribute of an XML Schema. As pointed out in the Introduction, this 
concept plays a key role in the various algorithms which our approach consists 
of. Preliminarily we introduce the concept of x-component which allows both 
elements and attributes of an XML document to be uniformly handled. 

Definition 1. Let S be an XML Schema; an x-component of S is either an 
clement or an attribute of S. □ 

An x-component is characterized by its name, its typology (indicating if it is 
either a complex element or a simple element or an attribute) and its data type. 

Definition 2. Let S be an XML Schema; the set of its x-components is denoted 
as XCompSet(S). □ 



We introduce now some boolean functions that allow to determine the strength 
of the relationship existing between two x-components xs and xt of an XML 
Schema S. They will be exploited for deriving interscheme properties and, ulti- 
mately, for integrating XML Schemas. The functions are: 

— verydose(xs , xt) , that returns true if and only if: (i) x T = x s , or (ii) x T is 
an attribute of xg, or (Hi) xt is a simple sub-element of xg; 

— close(xs, xt), that returns true if and only if (i) xt is a complex sub-element 
of xs, or (ii) xt is an element of S and xs has an IDREF or an IDREFS 
attribute referring xt', 

— near(xs,XT), that returns true if and only if either verydose(xs, Xt) = true 
or close(xs,XT) — true; in all the other cases it returns false; 

— reachable(xs,XT), that returns true if and only if there exists a sequence 
of distinct x-components xi, xi, ■ ■ ■ , x n such that xs — x\, near(x\, X2) = 
near (x2,xs) = . . . = near{x n -\,x n ) = true,x n — xt- c 

We are now able to compute the connection cost from xs to xt- 

Definition 3. Let S be an XML Schema and let xs and xt be two x-components 
of S. The Connection Cost from xs to xt, denoted by CC(xs,xt), is defined 
as: 



where Cst — min XA (CC(xs, xa) + CC(xa,x t )) for each xa such that 
reachable{xs,XA) — reachable(xA, x T ) — true. 

We are now provided with all tools necessary to define the concept of neighbor- 
hood of an x-component. 

Definition 4. Let S be an XML Schema and let xs be an x-component of S. 
The j th neighborhood of xs is defined as: 

neighbor hood(xs,j) — {xt\ xt £ XCompSet(S),CC(xs,XT) < j} 
The construction of all neighborhoods can be easily carried out with the support 
of the data structure introduced in the next definition. 

Definition 5. Let D be an XML document and let S be the corresponding XML 
Schema. The XS- Graph relative to S and D is an oriented labeled graph defined 
as XG(S,D) = (N(S),A(S,D)). Here, N(S) is the set of nodes of XG(S,D); 
there is a node in XG(S, D) for each x-component of S. A(S, D) is the set of arcs 
of XG(S,D); there is an arc (Ns,Nt, fsr) in XG(S,D) for each pair (xs,Xt) 
such that near(xs,XT) — true; in particular, Ns (resp., Nt) is the node of 
XG(S,D) corresponding to xs (resp., Xt) and fsT = CC(xs,xt)- □ 




if veryclose(xs, xt) = true 

1 if close(xs, xt) = true 

Cst if reachable(xs,XT) = true and near(xs,XT) = false 
00 if reachable(xs, xt) = false 



The following proposition measures the computational complexity of the con- 
struction of XG(S,D). 



Proposition 1. Let D be an XML document and let S be the corresponding 
XML Schema. Let n be the number of x-components of S and let Ni nst be 
the number of instances of D. The worst case time complexity for constructing 
XG(S, D) from S and D is 0(max{n, Nf nst }). □ 

With regard to this result we observe that, in an XML document, in order to 
determine the element which an IDREFS attribute refers to, it is necessary to ex- 
amine the document, since neither the DTD nor the XML Schema provide such 
an information. As a consequence, the dependency of the computational com- 
plexity from Ni ns t cannot be avoided. However, we point out that the quadratic 
dependency from Ni nst is mainly a theoretical result; indeed, it derives from the 
consideration that each IDREFS attribute could refer to Ni nst components. Ac- 
tually, in real situations, each IDREFS attribute refers to a very limited number 
of instances; as a consequence, the dependency of the computational complexity 
from N inst is generally linear. 

The next theorem determines the worst case time complexity for computing 
all neighborhoods of all x-components of an XML Schema S. 

Theorem 1. Let XG(S,D) be the XS-Graph associated with an XML doc- 
ument D and an XML Schema S and let n be the number of x-components 
of S. The worst case time complexity for computing all neighborhoods of all 
x-components of S is 0(n 3 ). □ 

Example 1. Consider the XML Schema Si, shown in Figure [TJ representing a 
shop. Here customer is an x-component and its typology is "complex element" 
since it is an element declared with a "complex type" . Analogously SSN is an 
x-component, its typology is "attribute" and its data type is "string". All the 
other x-components of Si , the corresponding typologies and data types can be 
determined similarly. 

In Si, very close(customer, firstName) = true because firstName is a sim- 
ple sub-element of customer, analogously veryclose(customer, SSN) — true and 
close(customer, music Acquirement) — true. As for neighborhoods, we have 
that: 

neighborhood(customer,0) — {customer, SSN, firstName, lastName, address, 

gender, birthDate, profession} 
All the other neighborhoods can be determined similarly. □ 

3 Extraction of interscheme properties 

In this section we illustrate an approach for computing interscheme properties 
among x-components belonging to different XML Schemas. As pointed out in the 
Introduction, their knowledge is crucial for the integration task. The interscheme 
properties considered in this paper are synonymies and homonymies. Given two 
x-components xa and xb belonging to different XML Schemas, a synonymy be- 
tween xa and xb indicates that they represent the same concept; an homonymy 



<?xml version="1.0" encoding="UTF-8"?> <xs:schema 
xmlns : xs="http : //www . w3 . org/2001/XMLSchema"> 
<! — Definition of attributes — > 
<xs:attribute name="SSN" type="xs : string"/> 
<xs : attribute name="code" type="xs : ID"/> 
<xs:attribute name="acquiredBooks" type="xs : IDREFS"/> 
<xs:attribute name="acquiredMusics" type="xs : IDREFS"/> 
<xs : attribute name="acquirementDate" type="xs : date"/> 
<! — Definition of simple elements — > 
<xs : element name="f irstName" type="xs : string" /> 
<xs : element name="lastName" type="xs : string" /> 
<xs : element name=" address" type="xs : string" /> 
<xs : element name="gender" type="xs : string" /> 
<xs : element name="birthDate" type="xs : date"/> 
<xs : element name =" prof ess ion" type="xs : string" /> 
<xs : element name=" artist" type="xs : string" /> 
<xs : element name=" author" type="xs : string" /> 
<xs : element name=" title" type="xs : string" /> 
<xs : element name =" pub Year" type="xs : integer "/> 
<xs : element name="publisher" type="xs : string" /> 
<xs : element name="genre" type="xs : string" /> 
<xs : element name=" support" type="xs : string" /> 
<! — Definition of complex elements — > 
<xs : element name="bookAcquirement"> 
<xs : complexType> 

<xs : attribute ref ="acquirementDate"/> 
<xs : attribute ref ="acquiredBooks"/> 
</ xs : complexType> 
</xs : element> 

<xs : element name="music Acquirement "> 
<xs : complexType> 

<xs : attribute ref ="acquirementDate"/> 
<xs : attribute ref ="acquiredMusics"/> 
</xs : complexType> 
</xs : element> 

<xs:element name="customer"> 
<xs : complexType> 

<xs : element ref ="f irstName"/> 
<xs : element ref ="lastName"/> 
<xs : element ref ="address"/> 
<xs : element ref = "gender "/> 
<xs : element ref ="birthDate"/> 
<xs : element ref ="prof ession"/> 



<xs : element ref ="bookAcquirement" 

minOccurs="0" maxDccurs= , 'unbounded" /> 

<xs : element ref ="musicAcquirement" 

minOccurs="0" maxDccurs= , 'unbounded" /> 
</xs:sequence> 

<xs:attribute ref="SSN" use="required"/> 
</xs : complexType> 

<xs : complexType> 
<xs : sequence> 

<xs : element ref = "artist" maxDccurs="unbounded"/> 

<xs : element ref = "title "/> 

<xs:element ref = "pub Year "/> 

<xs:element ref ="genre"/> 

<xs : element ref =" support "/> 
</xs:sequence> 

<xs : attribute ref ="code" use="required"/> 
</xs : complexType> 

<xs : element name="book"> 
<xs : complexType> 

<xs : element ref = "author " maxDccurs="unbounded"/> 
<xs:element ref ="title"/> 
<xs:element ref ="publisher "/> 
<xs : element ref = "pub Year "/> 
<xs:element ref ="genre"/> 

<xs : attribute ref =" code" use="required"/> 
</xs : complexType> 

<! — Definition of root element — > 
<xs : element name="shop"> 
<xs : complexType> 

<xs : element ref =" customer " maxDccurs="unbounded"/> 
<xs : element ref = "music" maxDccurs="unbounded"/> 
<xs : element ref = "book" maxDccurs="unbounded"/> 
</xs:sequence> 
</xs : complexType> 
</xs : element> 



Fig. 1. The XML Schema Si 



between xa and xb denotes that they indicate different concepts yet having the 
same name. 

Our technique for computing interscheme properties is semantic [217115) in 
that, in order to determine the meaning of an x-componcnt, it examines the 
"context" which it has been defined in. It requires the presence of a thesaurus 
storing lexical synonymies existing among the terms of a language. In particular, 
it exploits the English language and WordNe10 [13] . The technique first extracts 
all synonymies and, then, exploits them for deriving homonymies. 



3.1 Derivation of synonymies 

As previously pointed out, in order to verify if two x-components x\ j , belonging 
to an XML Schema Si, and X2 k , belonging to an XML Schema S2, are synony- 
mous, it is necessary to examine their neighborhoods. In particular, our approach 
operates as follows. 

First it considers neighborhood(xi j , 0) and neighborhood(x2 k , 0) and de- 
termines if they are similar. This decision is made by computing the objec- 



1 Actually, in the prototype implementing our technique, WordNet is accessed by a 
suitable API. 



tive function associated with the maximum weight matching of a suitable bi- 
partite graph constructed from the x-components of neighborhood(x\ j , 0) and 
neighborhood(x2 k , 0) and their lexical synonymies as stored in the thesaurus 
(see below for all details). If neighbor hood{x\ j , 0) and neighborhood(x2 k ,0) are 
similar it is possible to conclude that x%. and X2 k are synonymous |2ll5j . How- 
ever, observe that neighborhood(x\- , 0) (resp., neighborhood(x 2k ,0)) takes into 
account only attributes and simple elements of X\ j (resp., X2 k )] therefore, it con- 
siders quite a limited context. As a consequence, the synonymy between x\ . and 
X2 k derived in this case is more "syntactic" than "semantic" |7I2I15| . 

If we need a more "severe" level of synonymy detection it is necessary to 
require not only the similarity of neighborhood(x\ i , 0) and neighborhood{x2 k , 0) 
but also that of the other neighborhoods of x\. and X2 k . More specifically, it 
is possible to introduce a "severity" level u at which synonymies are derived 
and to say that x\. and X2 k are synonymous with severity level equal to u if 
neighborhood(xi j , v) is similar to neighborhood(x2 k , v) for each v less than or 
equal to u. The following proposition states an upper bound to the severity level 
that can be specified for x-component synonymy derivation. 

Proposition 2. Let Si and S2 be two XML documents; let xi j (resp., X2 k ) be 
an x-component of Si (resp., S2); finally let m be the maximum between the 
number of complex elements of Si and S% . The maximum severity level possibly 
existing for the synonymy between x\, and X2 k is m — 1 . □ 

A function synonymous can be defined which receives two x-components Xi j 
and X2 k and an integer u and returns true if x\, and X2 k are synonymous with 
a severity level equal to u, false otherwise. 

As previously pointed out, computing the synonymy between two x-compo- 
nents xi j and X2 k implies determining when two neighborhoods are similar. In 
order to carry out such a task, it is necessary to compute the objective function 
associated with the maximum weight matching relative to a specific bipartite 
graph obtained from the x-components of the neighborhoods into consideration. 

More specifically, let BG(x\ } , X2 k , u) = (N(xi, , X2 k , u),A(xi. , X2 k , u)) be the 
bipartite graph associated with neighbor hood(xi , u) and neighborhood(x2 k , u) 
(in the following we shall use the notation BG(u) instead of BG(xi. , X2 k , u) 
when this is not confusing). In BG(u), N(u) = P(u) U Q(u) represents the 
set of nodes; there is a node in P(u) (resp., Q{u)) for each x-component of 
neighborhood(x\ j , u) (resp., neighborhood(x2 k ,u)). A(u) is the set of arcs; there 
is an arc between p e £ P{u) and g/ € Q(u) if a synonymy between the names of 
the x-components associated withp e and gf holds in the reference thesaurus. The 
maximum weight matching for BG(u) is a set A'(u) C A(u) of edges such that, 
for each node x £ P(u)L)Q(u), there is at most one edge of A'(u) incident onto x 
and |A'(u)| is maximum (for algorithms solving the maximum weight matching 
problem, see [S]). The objective function we associate with the maximum weight 

matching is BG {u) = \ P ^§ Q { u)l - 

We assume that if (/>bg{u) > \ then neighborhood{xi j) u) and 
neighborhood(x 2k , u) are similar; otherwise they are dissimilar. Such an assump- 



tion derives from the consideration that two sets of objects can be considered 
similar if the number of similar components is greater than the number of the 
dissimilar ones or, in other words, if the number of similar components is greater 
than half of the total number of components. 

We present now the following theorem stating the computational complexity 
of the x-components' similarity extraction. 

Theorem 2. Let Si and S2 be two XML documents. Let x\. (resp., X2 k ) be 
an x-component of S\ (resp., S2). Let u be the selected severity level. Finally, 
let p be the maximum between the cardinality of neighborhood{x\ j , u) and 
neighborhood(x2 kl u). The worst case time complexity for computing synony- 
mous(xi j , X2 k , u) is 0((u + 1) x p 3 ). □ 

Corollary 1. Let Si and S2 be two XML documents. Let u be the severity 
level. Let m be the maximum between the number of complex elements of S\ 
and 1S2. Finally, let q be the maximum cardinality relative to a neighborhood of 
5*1 or 5*2. The worst case time complexity for deriving all synonymies existing, 
at the severity level w, between Si and S2 is 0((u + 1) x q 3 x to 2 ). □ 



3.2 Derivation of homonymies 

After synonymies among x-components of Si and S2 have been extracted, ho- 
monymies can be directly derived from them. More specifically, we say that 
an homonymy holds between xi j and X2 k with a severity level equal to u if 
synonymous(xi j ,X2 k ,u) — false and both xi j and X2 k have the same name. 

It is possible to define a boolean function homonymous, which receives two 
x-components xi j and X2 k and an integer u and returns true if there exists an 
homonymy between xi j and X2 k with a severity level equal to u; homonymous 
returns false otherwise. 

Example 2. Consider the XML Schemas Si and S2, shown in Figures [1] and [H 
Consider also the x-components customer^g^ and client[s 2 ] ■ I n order to check if 
they are synonymous with a severity level 0, it is necessary to compute the func- 
tion synonymous^ustomer^g^, client^g^, 0). Now, neighborhood(customer\g^ 
0) has been shown in Example [TJ as for neighbor hood(client[s 2 ],0), we have: 

neighborhood(client[g 2 j,0) = {client^g^, SSN^g 2 ^, firstName^g^, 

lastName[g 2 ], address[g 2 ], phoneig 2 ], emailig 2 ]} 

The function 4>bg{^) computed by synonymous in this case is rpmyrr^[oJ1 = 
= 0.67 > i; therefore synonymousfcustomer^g^jClient^g^jO) — true. 
In an analogous way, synonymous^ustomer^g^client^g^,!) can be com- 
puted. In particular, in this case, 0sg(1) = 0.43 < \\ as a consequence, 



2 Here and in the following, we use the notation X[s] to indicate the x-component x 
of the XML Schema S. 



<?xml version="l. 
<xs: schema xmlns : 
<!— Definiti 
<xs : attribute name : 
<xs : attribute name- 
<xs : attribute name : 
<xs : attribute name- 
<xs : attribute name- 
<xs : attribute name- 
<xs : attribute name : 
<!— Definition of 
element 
element 
element 



ncoding="UTF-8"?> 

http : //www . w3 . org/2001/XMLSchema" > 
■f attributes — > 
;e="SSN" type="xs : string"/> 
.e="code" type="xs : ID"/> 

purchasedCDDAs" type="xs : IDREFS"/> 
purchasedMiniDisks" type="xs : IDREFS"/> 
purchaseDate" type="xs : date"/> 



elei 



-quantity type= 
="bitRate" type=" 

simple elements 
f irstName" type=" 
lastName" type="x 
address" type="xs 
phone" type="xs : s 

artist" type="xs: 



s : integer "/> 
s : integer "/> 



■ Defini- 
element 
<xs : complexType> 
<xs : attribut 
<xs : attribut 
</xs : complexType 

<xs: element name="mi: 
<xs : complexType> 
<xs : attribut 
<xs : attribut 
</xs : complexType 
</xs:element> 
<xs:element name="cl 
<xs : complexType> 

<xs :elem 
<xs : elem 
<xs :elem 
<xs :elem 



:string"/> 
string"/> 
tring"/> 
ing"/> 
ing"/> 
ring"/> 
tie" type="xs:string'7> 
ng" type="xs:string"/> 
ar" type="xs : integer "/> 
nre" type="xs : string'7> 
of complex elements — > 
="CDDAPurchase"> 



ef ="purcha 
ef ="purcha 



:eDate"/> 
:edCDDAs"/> 



iDiskPurchase"> 



! "purchaseDate"/> 
■"purchasedMiniDisks "/> 



t ref ="f irstName "/> 
t ref =" lastName "/> 
t ref = "address "/> 
t ref="phone" minOccurs="0" 
maxDc cur s=" unbounded"/ > 
<xs : element ref = "email" minOccurs="0" 

maxDc cur s=" unbounded"/ > 
<xs : element ref ="CDDAPur chase" minOccurs="0" 
maxDc cur s=" unbounded"/ > 



<xs : element ref ="miniDiskPur chase" minOccurs= 
maxO c cur s= "unbounded " / > 

<xs : attribute ref ="SSN" use=" required" /> 
</xs : complexType> 
s : element> 



: "CDDA": 



omplexType> 
<xs : attribute ref =" code" use="required"/> 
<xs : attribute ref ="quantity"/> 
</xs : complexType> 
: element> 



<xs : element name="miniDisk' 
<xs : complexType> 

<xs : attribute ref=' 
<xs : attribute ref =' 
<xs : attribute ref=' 
</xs : complexType> 
</xs:element> 

<xs : element name="compositi' 
<xs : complexType> 
<xs : sequence> 

<xs: element ref ; 
<xs: element ref : 
<xs: element ref ; 
<xs: element ref ; 
<xs: element ref 
<xs: element ref ; 
<xs: element ref- 
</xs : sequence> 
</xs : complexType> 
</xs : element> 

<! — Definition of root element — : 
<xs : element name=" store "> 
<xs : complexType> 
<xs : sequence> 

<xs:element ref="clien 
<xs:element ref="compo: 
</xs : sequence> 
</xs : complexType> 
</xs : element> 
</xs : schema> 



code" use="required"/> 
quant ity"/> 
bitRate"/> 



artist" maxDccurs="unbounded , 7> 
title"/> 

song" maxDccurs="unbounded , 7> 
year"/> 
genre "/> 

CDDA" minDccurs="0"/> 
miniDisk" minDccurs="0"/> 



maxDccurs="unbounded'7> 
tion" maxDccurs="unbounded"/> 



Fig. 2. The XML Schema S 2 



synonymous{customer{s 1 \,client{s 2 \^) = false, i.e. customer^!] and client[s 2 ] 
cannot be considered synonymous with a severity level 1. 

All the other synonymies can be derived analogously. As for these Schemas, 
no homonymy has been found. □ 



4 The Integration Task 

In this section we propose an integration algorithm which receives two XML 
Schemas S\ and 5 2 and a severity level u and returns the integrated XML 
Schema Sq- The algorithm consists of two steps, namely: (i) construction of a 
Merge Dictionary MD(u) and a Rename Dictionary RD(u): (ii) exploitation of 
MD(u) and RD(u) for obtaining the global Schema. 

Preliminarily it is necessary to observe that in XML Schemas there exists 
a large variety of data types. Some of them, e.g. Byte and Int, are compatible 
in the sense that each attribute or simple element whose type is Byte can be 
treated as an attribute or a simple element whose type is Int; in this case Int is 
said more general than Byte. Other types, e.g. Int and Date, are not compatible. 
Compatibility rules are analogous to the corresponding ones valid for high level 
programming languages. 



4.1 Construction of MD(u) and RD(u) 



At the end of interscheme property derivation, it could happen that an x- 
component of a Schema is synonymous (resp., homonymous) with more than one 
x-components of the other Schema. The integration algorithm we are proposing 
here needs each x-component of a Schema to be synonymous (resp., homony- 
mous) with at most one x-component of the other Schema. In order to satisfy 
this requirement, it is necessary to construct a Merge Dictionary MD(u) and an 
Rename Dictionary RD{u) by suitably filtering previously derived synonymies 
and homonymies. 

The construction of AID(u) begins with the definition of a support bipartite 
graph SimG{u) — (SimNSeti{u) U SimN Set 2 {u) , SimASet{u)) . 

There is a node n\. (resp., n 2k ) in SimN Set\(u) (resp., SimNSet 2 {u)) for 
each complex element E\. (resp., E 2k ) belonging to Si (resp., S 2 ). There is an 
arc Ajk = (ni j ,n 2k ) G SimASet(u) if synonymous (Ei j , E 2k ,u) = true; the 
label of each arc Aj^ is f(ni j ,n 2k ) where: 

, . _ j 4>bg{Ei 3 , E 2k , u) if Ajk G SimASet(u) 

A ni f"aJ - \q otherwise 
Function / has been defined in such a way to maximize the sum of the similarity 
degrees involving complex elements of 5*1 and S 2 . 

After this, a maximum weight matching is computed on SimG{u); this se- 
lects a subset SimASubSet(u) C SimASet(u) which maximizes the objective 
function 4> Sim {u) = T,( ni ., n2k )eSimASubSet(u) /(ni^ttej. 

For each arc A'- k = {n'i.,n 2k ) G SimASubSet(u) a pair (E[.,E' 2k ) is added 
to MD(u). 

In addition, let E[. (resp., E' 2k ) be a complex element of Si (resp., S 2 ) such 
that (E' lj: E 2k ) G MD(u) and let x\. (resp., x' 2k ) be an attribute or a simple 
element of E[. (resp., E' 2k ); then (x[. , x' 2k ) is added to MD{u) if (i) a synonymy 
between the name of x\ and that of xi, holds in the reference thesaurus and 
the data types of x[ . and x' 2 are compatible, or (ii) x\ . and x' 2 have the same 
name, the same typology and compatible data types. 

After MD(u) has been constructed, it is possible to derive RD(u). More 
specifically, a pair of x-components (x'{. , x 2k ) is added to RD(u) ifx'{. and 
two elements or two attributes having the same name and {x'{.,x 2k ) £ MD(u). 

4.2 Construction of the global XML Schema 

After MD(u) and RD(u) have been derived, it is possible to exploit them for 
constructing a global Schema Sg- Our integration algorithm assumes that Si 
and S2 are represented in the referenced style, i.e., that they consist of sequences 
of elements and that each element may refer to other elements by means of 
the ref attribute. Actually, an XML Schema could be defined in various other 
ways (e.g., with the inline style); however, simple rules can be easily defined 
for translating it in the referenced style (see [TJ for more details on the various 
definition styles). 

More formally, Si and S 2 can be represented as: 



Si = (x ll7 x l27 . . . ,x lit . . . ,xi n ); S 2 = (x 2l ,X2 2 , ■ ■ ■ ,x 2j , ■ ■ ■ ,x 2m ) 
where x\ 17 . . . , x\ n , x 2l , . . . , x 2m are x-components. A first, rough, version of Sg 
can be obtained by constructing a list containing all the x-components of Si and 
S 2 : 

Sg = (zii , • • • , xi n , x 2l , . . . , x 2m } 
This version of Sg could present some redundancies and/or ambiguities. In 
order to remove them and, consequently, to refine Sg, MD(u) and RD{u) must 
be examined and some tasks must be performed for each of the properties they 
store. More specifically, consider MD(u) and let (Ei j ,E 2k ) G MD(u) be a syn- 
onymy between two complex elements. E\ . and E 2k are merged into a complex 
element Ejk- The name of Ejk is one between the names of E\ j and E 2k . The 
set of sub-elements of Ejk is obtained by applying the xs : sequence indicator to 
the sets of sub-elements of E lj and E 2k ; the list of attributes of Ejk is formed 
by the attributes of E\ j and E 2k . Note that, after these tasks have been carried 
out, it could happen that: 

— A tuple (A'j k ,A'- k ), such that A'j k and A'- k are attributes of Ejk, belongs 
to MD(u). In this case A'- k and A" k are merged into an attribute A* k ; the 
name of A* k is one between the names of Aj k and A" k ; the type of A* k is 
the most general one between those of A'- k and A" k . 

— A tuple (Ej k ,E" k ), such that E'- k and Ej k are simple elements of Ejk, belongs 
to MD(u). In this case E'j k and E'- k are merged into an element E* k ; the 
name of E* k is one between the names of E'j k and E" k ; the type of E* k is the 
most general one between those of E'^ k and E" k ; the minOccurs (resp., the 
maxOccurs) indicator of E* k is the minimum (resp., the maximum) between 
the corresponding ones relative to E'- k and E'- k . 

— A tuple (Ej k , Aj k ), such that E'J k is a simple sub-element of Ejk and A" k is 
an attribute of Ejk, belongs to MD{u). In this case, A'- k is removed since 
its information content is equivalent to that of E" k and the representation 
of an information content by means of an element is more general than that 
obtained by exploiting an attribute. 

After this, all references to E\, and E 2k in Sq are transformed into references 
to Ejk', the maxOccurs and the minOccurs indicators associated with Ejk are 
derived from the corresponding ones relative to E lj and E 2k and, finally, E lj is 
replaced by Ejk whereas E 2k is removed from Sg- 

After MD(u) has been examined, it is necessary to consider RD(u); in partic- 
ular, let {x ljl x 2k ) be a tuple of RD{u) such that x\ j and x 2k are both elements 
or both attributes of the same element. In this case it is necessary to modify the 
name of either x\ j or x 2k and all the corresponding references. 

Observe that, after all these activities have been performed, Sg could contain 
two root elements. Such a situation occurs when the root elements E\ r of S\ and 
E 2t of S 2 are not synonymous. In this case it is necessary to create a new root 
element Eq t in Sg whose set of sub-elements is obtained by applying the xs : all 
indicator to E\ r and E 2r . The occurrence indicators associated with E\ r and E 2r 
are minOccurs — and maxOccurs = 1. 

As for the computational complexity of the integration task, it is possible to 
state the following theorem. 



x-co7nponent of Si 


x-component of S2 




x- component of S\ 


x- component of S2 


shop 


store 




customer 


client 


music 


composition 




SSN 


SSN 


first Name 


firstName 




lastName 


lastName 


address 


address 




code 


code 


artist 


artist 




title 


title 


pub Year 


year 




genre 


genre 



Table 1. The Merge Dictionary MD(0) 



Theorem 3. Let Si and S2 be two XML Schemas, let n be the maximum 
between \XCompSet(S\)\ and \XCompSet(S2)\ and let m be the maximum 
between the number of complex elements of Si and the number of complex 
elements of S2 . The worst case time complexity for integrating Si and S2 into a 
global Schema Sg is 0(m x n 2 ). □ 



Example 3. Assume a user wants to integrate the XML Schemas S\ and S2, 
shown in Figures [1] and [21 and the severity level she/he specifies is 0. MD(0) is 
illustrated in Table [TJ RD(0) is empty because no homonymy has been found 
among x-components of 5*1 and S2 (see Example [2J . Initially a rough version of 
Sq is constructed that contains all the x-components of Si and S2', the refined 
version of S& is obtained by removing (possible) redundancies and/or ambigui- 
ties present therein. 

The first step of the refinement phase examines all synonymies among com- 
plex elements stored in MD(0). As an example, consider the synonymous ele- 
ments customer^)] and client[s 2 y, they must be merged in one single element. 
This task is carried out as follows. First a new element customer] s a ] is cre- 
ated in Sg- The set of sub-elements of customer^s^ is obtained by apply- 
ing the xs: sequence indicator to the sets of sub-elements of customer^ Sl ^ and 
clienttg 2 y, the list of attributes of customer^'G] ^ s formed by the attributes of 
customer^ ^ and client[g 2 y At the end of this task, customer^ G ^ contains two 
attributes named SSN. Since the tuple (SSN, SSN) belongs to MD(0), the two 
attributes are merged into a single attribute SSN having type "string" . An analo- 
gous procedure is applied to sub-element pairs (firstName^g^, f ir stN ame^g 2 y) , 
(lastName^Si], lastName^]) and (addresses ]i> address^g^} . 

After this, all references to customer^g^ and client[g 2 } are transformed into 
references to customer^ G y, finally, customer^g^ is replaced by customer^s G ] where- 
as client[s 2 ] is removed from Sq- All the other synonymies stored in MD(0) are 
handled similarly. Since no homonymy has been found, no further action is nec- 
essary. The global XML Schema Sq, obtained at the end of the integration 
activity, is shown in Figure [3] 



□ 



<?xml version="1.0" encoding="UTF-8"?> 

<xs: schema xmlns :xs="http : //www .w3 . org/2001/XMLSchema"> 
<! — Definition of attributes — > 
<xs: attribute name="SSN" type="xs : string"/> 
<xs : attribute name="code" type="xs : ID"/> 
<xs: attribute name="acquiredBooks" type="xs : IDREFS"/> 
<xs: attribute name="acquiredMusics" type="xs : IDREFS"/> 
<xs: attribute name="acquirementDate" type="xs:date"/> 
<xs:attribute name="purchasedCDDAs" type="xs : IDREFS"/> 
<xs: attribute name="purchasedMiniDisks" type="xs : IDREFS"/> 
<xs : attribute name="pijrchaseDate" type="xs : date"/> 
<xs : attribute name=" quantity" type="xs : integer "/> 
<xs : attribute name="bitRate" type="xs : integer "/> 
<! — Definition of simple elements — > 
<xs : element name="f irstName" type="xs : string"/> 
<xs:element name="lastName" type="xs : string"/> 
<xs:element name="address" type="xs : string"/> 
<xs:element name="gender" type="xs : string"/> 
<xs:element name="birthDate" type="xs:date"/> 
<xs:element name="prof ession" type="xs : string"/> 
<xs:element name="phone" type="xs : string"/> 
<xs : element name=" email" type="xs : string" /> 
<xs : element name=" artist" type="xs : string" /> 
<xs : element name=" author" type="xs : string" /> 
<xs:element name="title" type="xs : string"/> 
<xs:element name="song" type="xs : string"/> 
<xs:element name="pubYear" type="xs : integer "/> 
<xs : element name="publisher" type="xs : string" /> 
<xs:element name="genre" type="xs : string"/> 
<xs:element name="support" type="xs : string"/> 
<! — Definition of complex elements — > 
<xs : element name="bookAcquirement"> 

<xs : complexType> 

<xs : attribute ref ="acquirementDate"/> 
<xs : attribute ref ="acquiredBooks"/> 

</xs: complexType> 
</xs:element> 

<xs : element name="musicAcquirement"> 
<xs : complexType> 

<xs : attribute ref ="acquirementDate"/> 
<xs : attribute ref ="acquiredMusics"/> 
</xs : complexType> 
</xs:element> 
<xs:element name="book"> 
<xs : complexType> 
<xs :sequence> 

<xs : element ref = "author " maxQccurs= "unbounded" /> 
<xs:element ref ="title"/> 
<xs:element ref ="publisher "/> 
<xs : element ref ="pubYear "/> 
<xs : element ref = "genre "/> 
</xs : sequence> 

<xs : attribute ref =" code" use= "required" /> 
</xs : complexType> 
</xs:element> 

<xs: element name="CDDAPurchase"> 
<xs : complexType> 

<xs : attribute ref ="purchaseDate"/> 
<xs : attribute ref ="purchasedCDDAs"/> 
</xs : complexType> 
</xs:element> 

<xs: element name="miniDiskPurchase"> 
<xs : complexType> 

<xs : attribute ref ="purchaseDate"/> 
<xs : attribute ref ="purchasedMiniDisks"/> 
</xs : complexType> 
</xs:element> 



<xs : complexType> 
<xs:sequence> 

<xs:element ref ="f irstName"/> 
<xs:element ref ="lastName"/> 
<xs:element ref ="address"/> 
<xs:element ref = "gender "/> 
<xs:element ref ="birthDate"/> 
<xs:element ref ="prof ession"/> 
<xs : element ref ="bookAcquirement" 

minOccurs="0" maxOccurs="unbounded"/> 
<xs : element ref ="musicAcquirement" 

minOccurs="0" maxDccurs="unbounded"/> 
<xs:element ref="phone" 

minOccurs="0" maxOccurs="unbounded"/> 
<xs:element ref="email" 

minOccurs="0" maxDccurs="unbounded"/> 
<xs:element ref ="CDDAPur chase" 

minOccurs="0" maxOccurs="unbounded"/> 
<xs : element ref ="miniDiskPur chase" 

minOccurs="0" maxDccurs= , 'unbounded" /> 
</xs:sequence> 

<xs:attribute ref="SSN" use="required"/> 
</xs : complexType> 
</xs :element> 
<xs : element name="CDDA"> 
<xs : complexType> 

<xs : attribute ref="code"/> 
<xs : attribute ref = "quantity" /> 
</xs : complexType> 
</xs :element> 

<xs : element name="miniDisk"> 
<xs : complexType> 

<xs:attribute ref="code"/> 
<xs : attribute ref = "quantity" /> 
<xs:attribute ref ="bitRate"/> 
</xs : complexType> 
</xs:element> 
<xs : element name= "music "> 
<xs : complexType> 
<xs:sequence> 

<xs : element ref = "artist" maxDccurs="unbounded"/> 

<xs:element ref ="title"/> 

<xs:element ref =" pub Year" /> 

<xs:element ref ="genre"/> 

<xs: element ref =" support" minDccurs="0"/> 

<xs:element ref="song" 

minOccurs="0" maxDccurs="unbounded"/> 
<xs : element ref ="CDDA" minOccurs="0"/> 
<xs : element ref ="miniDisk" minOccurs="0"/> 
</xs:sequence> 
<xs:attribute ref="code"/> 
</xs : complexType> 
</xs:element> 

<! — Definition of root element — > 
<xs : element name="shop"> 
<xs : complexType> 
<xs:sequence> 

<xs:element ref =" customer " 

maxO c cur s= " unbounded "/ > 
<xs:element ref="music" 

maxO c cur s= "unbounded " / > 
<xs:element ref="book" 

minOccurs="0" maxDccurs="unbounded"/> 
</xs:sequence> 
</xs : complexType> 
</xs :element> 
</xs :schema> 



Fig. 3. The integrated XML Schema S G 



5 Experiments 



To test the performances of our approach we have carried out various experi- 
ments; these have been performed on several XML Schemas taken from different 
application contexts. Involved XML Schemas were very heterogeneous in their 
dimensions; indeed, the number of x-componcnts associated with them ranged 
from tens to hundreds. 

The first series of experiments has been conceived for measuring correctness 
and completeness of our interscheme property derivation algorithm. In partic- 
ular, correctness lists the percentage of properties returned by our techniques 
agreeing with those provided by humans; completeness lists the percentage of 
properties returned by our approach with regard to the set of properties pro- 
vided by humans. 

In more detail, we proceeded as follows: (i) we ran our algorithms on several 
pairs of XML Schemas and collected the returned results; (ii) for each pair of 
Schemas we asked humans to specify a set of significant interscheme properties; 
(Hi) we computed the overall quality figures by comparing the set of properties 
obtained as described at points 1 and 2 above. 

As for severity level 0, we have obtained a correctness equal to 0.88 and a 
completeness equal to 1,00. 

Actually, the intrinsic characteristics of our algorithm led us to think that, if 
the severity level increases, the correctness increases as well, whereas the com- 
pleteness decreases. In order to verify this idea, we have performed a second 
series of experiments devoted to measure correctness and completeness in pres- 
ence of variations of the severity level. Table [2] shows obtained results up to a 
severity level equal to 3; for higher severity levels, variations of correctness and 
completeness are not significant. 



Severity Level 


Correctness 


Completeness 


Level 


0.88 


1.00 


Level 1 


0.97 


0.81 


Level 2 


0.97 


0.78 


Level 3 


0.97 


0.73 



Table 2. Correctness and Completeness of our approach at various severity 
levels 



Results presented in Table [2] confirmed our intuitions. Indeed, at severity 
level 1, correctness increases of a factor of 9% whereas completeness decreases of 
a factor of 19% w.r.t. correctness and completeness relative to severity level 0. 
As for severity levels greater than 1, we have verified that correctness does not 
increase whereas completeness slightly decreases w.r.t. level 1. 

In our opinion such a result is extremely relevant; indeed, it allows us to 
conclude that, in informal situations, the right severity level is whereas, in 
more formal contexts, the severity level must be at least 1. 



After this, we have computed variations of the time required for deriving 
interscheme properties caused by an increase of the severity level. Obtained 
results are shown in Table [3] In the table the value associated with severity level 
i (1 < i < 3) is to be intended as the percentage of time additionally required 
w.r.t. severity level i — 1. 



Severity Level 


Time Increase 


Level 1 
Level 2 
Level 3 


56% 
14% 
20% 



Table 3. Increase of the time required by our approach at various severity levels 



Table [3] shows that the increase of time required for computing interscheme 
properties when the algorithm passes from the severity level to the severity 
level 1 is significant. Vice versa, further severity level increases do not lead to 
significant increases of the time necessary for computing interscheme properties. 
This observation further confirms results obtained by the previous experiments, 
i.e., that the most relevant differences in the results obtained by applying our 
approach can be found between the severity levels and 1. 

6 Related Work 

In the literature many approaches for performing interscheme property extrac- 
tion and data source integration have been proposed. Even if they are quite 
numerous and various, to the best of our knowledge, none of them guarantees 
the possibility to choose a "severity" level against which the various activities are 
carried out. In this section we examine some of these approaches and highlight 
their similarities and differences w.r.t. our own. 

In [16] an XML Schema integration framework is proposed. It consists of 
three phases, namely pre-integration, comparison and integration. After this, 
conflict resolution and restructuring are performed for obtaining the global re- 
fined Schema. To the best of our knowledge the approach of [16] is the closest to 
our own. In particular, (i) both of them are rule-based |17j : (ii) both of them as- 
sume that the global Schema is formulated in a referenced style rather than in an 
inline style (see [T] for more details); (Hi) integration rules proposed in pj)] are 
quite similar to those characterizing our approach. The main differences existing 
between them are the following: (i) the approach of [16 requires a preliminary 
translation of an XML Schema into an XSDM Schema; such a translation is 
not required by our approach; (ii) the integration task in [16] is graph-based and 
object-oriented whereas, in our approach, it is directly based on x- components;. 

In [TU] the system XClust is presented whose purpose is XML data source in- 
tegration. More specifically, XClust determines the similarity degrees of a group 
of DTD's by considering not only the corresponding linguistic and structural 



information but also their semantics. It is possible to recognize some similarities 
between our approach and XClust; in particular, (i) both of them have been 
specifically conceived for operating on XML data sources (even if our approach 
manages XML Schemas whereas XClust operates on DTD's); (ii) both of them 
consider not only linguistic similarities but also semantic ones. There are also 
several differences between the two approaches; more specifically, (i) to perform 
the integration activity, XClust requires the support of a hierarchical clustering 
whereas our approach adopts schema matching techniques; (ii) XClust represents 
DTD's as trees; as a consequence, element neighborhoods are quite different from 
those constructed by our approach; (in) XClust exploits some weights and thresh- 
olds whereas our approach does not use them; as a consequence, XClust provides 
more refined results but these last are strongly dependent on the correctness of 
a tuning phase devoted to set weights and thresholds. 

In [13] the system Rondo is presented. It has been conceived for integrat- 
ing and manipulating relational schemas, XML Schemas and SQL views. Rondo 
exploits a graph-based approach for modeling information sources and a set of 
high-level operators for matching obtained graphs. Rondo uses the Similarity 
Flooding Algorithm, a graph- matching algorithm proposed in [12] , to perform 
schema matching activity. Finally, it merges involved information sources ac- 
cording to three steps: Node Renaming, Graph Union and Conflict Resolution. 
There are important similarities between Rondo and our approach; indeed both 
of them are semi-automatic and exploit schema matching techniques. The main 
differences existing between them are the following: (i) Rondo is generic, i.e., 
it can handle various kinds of information sources; vice versa our approach is 
specialized for XML Schemas; (ii) Rondo models involved information sources 
as graphs whereas our approach directly operates on XML Schemas; (Hi) Rondo 
exploits a sophisticated technique (i.e., the Similarity Flooding Algorithm) for 
carrying out schema matching activities [12] ; as a consequence, it obtains very 
precise results but is time-expensive and requires a heavy human feedback; on 
the contrary, our approach is less sophisticated but is well suited when involved 
information sources are numerous and large. 

In [6] an XML-based integration approach, capable of handling various source 
formats, is presented. Both this approach and our own operate on XML docu- 
ments and carry out a semantic integration. However, (i) the approach of [6] 
operates on DTD's and requires to translate them in an appropriate formal- 
ism called ORM/NIAM [9]; vice versa, our approach directly operates on XML 
Schemas; (ii) the global Schema constructed by the approach of [6] is repre- 
sented in the ORM/NIAM formalism whereas our approach direcly returns a 
global XML Schema; (Hi) the approach of [6] is quite complex to be applied 
when involved sources are numerous. 

In [18] the DIXSE (Data Integration for XML based on Schematic Knowl- 
edge) tool is presented, aiming at supporting the integration of a set of XML 
documents. Both DIXSE and our approach are semantic and operate on XML 
documents; both of them exploit structural and terminological relationships for 
carrying out the integration activity. The main differences between them reside 



in the interscheme property extraction technique; indeed, DIXSE requires the 
support of the user whereas our approach derives them almost automatically. 
As a consequence, results returned by DIXSE could be more precise than those 
provided by our approach but, when the number of sources to integrate is high, 
the effort DIXSE requires to the user might be particularly heavy. 

In [I] a machine learning approach, named LSD (Learning Source Descrip- 
tion), for carrying out schema matching activities, is proposed. It has been ex- 
tended also to ontologies in GLUE [5J. LSD requires quite a heavy support of 
the user during the initial phase, for carrying out training tasks; however, af- 
ter this phase, no human intervention is required. Both LSD and our approach 
operate mainly on XML sources. They differ especially in their purposes; in- 
deed, LSD aims at deriving interscheme properties whereas our approach has 
been conceived mainly for handling integration activities. In addition, as far as 
interscheme property derivation is concerned, it is worth observing that LSD 
is "learner-based" whereas our approach is "rule-based" |17| . Finally, LSD re- 
quires a heavy human intervention at the beginning and, then, is automatic; 
vice versa, our approach does not need a pre-processing phase but requires the 
human intervention at the end for validating obtained results. 

In [5] the authors propose COMA (COmbining MAtch), an interactive and 
iterative system for combining various schema matching approaches. The ap- 
proach of COMA appears orthogonal to our own; in particular, our approach 
could inherit some features from COMA (as an example, the idea of operating 
iteratively) for improving the accuracy of its results. As for an important dif- 
ference between the two approaches, we observe that COMA is generic, since it 
handles a large variety of information source formats; vice versa, our approach 
has been specifically conceived to handle XML documents. In addition, our ap- 
proach requires the user to specify only the severity level; vice versa, in COMA, 
the user must specify the matching strategy (i.e., the desired matchers to exploit 
and the modalities for combining their results). 



7 Conclusions 

In this paper we have proposed an approach for the integration of a set of XML 
Schemas. We have shown that our approach is specialized for XML documents, 
is almost automatic, semantic and "light" and allows the choice of the "sever- 
ity" level against which the integration activity must be performed. We have 
also illustrated some experiments we have carried out to test its computational 
performances and the quality of results it obtains. Finally, we have examined 
various other related approaches previously proposed in the literature and we 
have compared them with ours by pointing out similarities and differences. 

In the future we plan to exploit our approach in various other contexts typi- 
cally benefiting of information source integration, such as Cooperative Informa- 
tion Systems, Data Warehousing, Semantic Query Processing and so on. 
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